Multiscale Analysis of Document Corpora Based on Diffusion Models

نویسندگان

  • Chang Wang
  • Sridhar Mahadevan
چکیده

We introduce a nonparametric approach to multiscale analysis of document corpora using a hierarchical matrix analysis framework called diffusion wavelets. In contrast to eigenvector methods, diffusion wavelets construct multiscale basis functions. In this framework, a hierarchy is automatically constructed by an iterative series of dilation and orthogonalization steps beginning with an initial set of orthogonal basis functions, such as the unitvector bases. Each set of basis functions at a given level is constructed from the bases at the lower level by dilation using the dyadic powers of a diffusion operator. A novel aspect of our work is that the diffusion analysis is conducted on the space of variables (words), instead of instances (documents). This approach can automatically and efficiently determine the number of levels of the topical hierarchy, as well as the topics at each level. Multiscale analysis of document corpora is achieved by using the projections of the documents onto the spaces spanned by basis functions at different levels. Further, when the input term-term matrix is a “local” diffusion operator, the algorithm runs in time approximately linear in the number of non-zero elements of the matrix. The approach is illustrated on various data sets including NIPS conference papers, 20 Newsgroups and TDT2 data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiscale Analysis of Data Sets with Diffusion Wavelets

Analysis of functions of manifolds and graphs is essential in many tasks, such as learning, classification, clustering. The construction of efficient decompositions of functions has till now been quite problematic, and restricted to few choices, such as the eigenfunctions of the Laplacian on a manifold or graph, which has found interesting applications. In this paper we propose a novel paradigm...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Persian Printed Document Analysis and Page Segmentation

This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...

متن کامل

Multiscale Analysis of Transverse Cracking in Cross-Ply Laminated Beams Using the Layerwise Theory

A finite element model based on the layerwise theory is developed for the analysis of transverse cracking in cross-ply laminated beams. The numerical model is developed using the layerwise theory of Reddy, and the von Kármán type nonlinear strain field is adopted to accommodate the moderately large rotations of the beam. The finite element beam model is verified by comparing the present numeric...

متن کامل

A Generic Analysis of the conclusion section of Research Articles in the field of sociology: A Comparative study

This paper reported on a genre-driven comparative study, which aimed to identify the generic moves in the conclusion sections of twenty research articles in the field of sociology written in the two codes of Persian and English. To meet this purpose, the researchers employed Moritz, Meurer, and Dellagnelo's model, which was set within the Swalesian framework of genre analysis. The analysis was ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009